Longitudinal datasets contain repeated measurements over time. They can arise from both observational studies and intervention trials, but we will focus on observational data here.
Analysis of longitudinal data has several advantages:
However, there are a few challenges that need to be accounted for during modeling:
Objective: To understand how longitudinal data can leverage within-person comparisons to improve the estimation of effects in complex trait epidemiology.
LMMs account for the expected similarity or correlation between specific observations in a dataset. The simplest way to account for this correlation is via a random intercept, which gives every individual a slightly different mean. Under the hood, this ends up fitting one single correlation describing the similarity of all values coming from the same individual. This means that (1) the correlation between two measurements from person A and two from person B are assumed to be the same, and (2) all measurements from a given person are assumed to be exchangeable, meaning that the similarity of two measurements doesn’t change if they are flipped or are farther apart in time.
Two extensions of this implicit model (that we won’t discuss further here) are: more complicated covariances (e.g., autoregressive models that expect the correlation between two measurements to decrease the farther apart they are in time) and random slopes (in which individuals have unique covariate effects, not just intercepts).
Next, we introduce the LMM that will be used in simulations below:
\(Y_{ij} = \beta_0 + \beta_1t_{ij} + \beta_2g_i +\beta_3e_{ij} + X^T_{ij}\beta_C + b_{1i} + \epsilon_{ij}\)
library(tidyverse)
library(longsim)
library(lme4)
library(lmerTest)
A longitudinal dataset will be simulated using the longsim package. Here, we create a wrapper function that calls various longsim simulation functions (simulate_exposure(), simulate_covariate(), etc.), converts each dataset into “long” format, and joins them to create a single dataset ready for analysis.
generate_dataset <- function(
N = 1000, # Number of individuals
K = 4, # Number of timepoints
maf = 0.25, # Minor allele frequency
icc_C = 0.8, # ICC for covariate
var_e_C = 1, # Error variance for covariate
beta_CE = 0, # C-E effect size
icc_E = 0.5, # ICC for exposure
var_e_E = 1, # Error variance for exposure
beta_tY = -1, # Slope with respect to time
beta_GY = 0.5, # Genotype effect size
beta_CY = 0, # C-Y effect size
beta_EY = 1, # E-Y effect size
icc_Y = 0.8, # ICC for outcome (across repeated measures)
var_e_Y = 1 # Error variance for outcome
) {
simulate_long_data( # High-level function from the longsim package
N, K,
maf,
icc_C, var_e_C,
beta_CE, icc_E, var_e_E,
beta_tY, beta_GY, beta_CY, beta_EY, icc_Y, var_e_Y
)
}
set.seed(1)
long_df <- generate_dataset()
Using the simulated dataset, we can first generate a few visualizations to gain some familiarity with its attributes and patterns.
long_df %>%
slice(1:100) %>%
ggplot(aes(x=t, y=Y, group=id)) +
geom_point() +
geom_line()
long_df %>%
slice(1:100) %>%
ggplot(aes(x=G, y=Y)) +
geom_boxplot(aes(group=G)) +
geom_point() +
scale_x_continuous(breaks=0:2)
lmm_fit <- lmer(Y ~ G + E + (1|id), data=long_df)
long_df$Y_pred <- predict(lmm_fit)
long_df %>%
filter(id %in% 1:5) %>%
ggplot(aes(x=E, y=Y, color=id)) +
geom_point(data=slice(long_df, 1:100), color="gray") +
geom_point() +
geom_line(aes(y=Y_pred), linewidth=1)
We see trends in Y based on time, genotype, and exposure. To plot person-specific E-Y regression lines, we will fit a linear mixed model incorporating a random intercept, then use the predict.lmm() function to generate predicted values that incorporate this heterogeneity in the intercept.
For most longitudinal data analyses, data should be converted into “long” format, with one row per timepoint (/observation), rather than one row per person. The columns of this dataset include a person-specific ID and timepoint (each row should have a unique combination of these two), time-varying fields (such as outcome Y and exposure E), and time-constant fields (such as genotype G). The primary goal of this analysis will be to estimate the effect of the time-varying exposure E on the outcome Y, which will be simulated as 1 unless otherwise noted.
print(head(long_df))
## # A tibble: 6 × 8
## id timept Y t G E C Y_pred
## <fct> <fct> <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 1 t1 -1.36 1 0 -0.823 0.969 -3.10
## 2 1 t2 -1.74 2 0 -0.621 1.64 -2.90
## 3 1 t3 -4.58 3 0 -1.15 2.19 -3.43
## 4 1 t4 -3.36 4 0 -0.532 1.86 -2.81
## 5 2 t1 -1.19 1 0 0.807 0.428 -2.01
## 6 2 t2 -1.07 2 0 1.41 0.792 -1.40
Next, we will fit our first linear mixed model (LMM) on the dataset. Though multiple packages and function exist for fitting LMMs, we will use the lmer() function from the lme4 package. Note the syntax used to include a random intercept per person in the regression formula: “(1|id)”.
Using this primary dataset, what do the results look like?
primary_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_df)
summary(primary_lmm_fit)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
## Formula: Y ~ t + G + E + (1 | id)
## Data: long_df
##
## REML criterion at convergence: 7771.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.10938 -0.59338 0.00226 0.59228 3.11933
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 0.7738 0.8796
## Residual 0.2023 0.4498
## Number of obs: 4000, groups: id, 1000
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.077e-02 3.984e-02 1.396e+03 0.270 0.787
## t -1.003e+00 6.361e-03 2.998e+03 -157.706 <2e-16 ***
## G 4.439e-01 4.586e-02 9.981e+02 9.678 <2e-16 ***
## E 1.001e+00 1.142e-02 3.575e+03 87.685 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) t G
## t -0.399
## G -0.567 0.000
## E -0.019 0.014 0.011
Two notes on the LMM summary above:
First, we define some helper functions to facilitate comparison of results across regression fits. The goal will be to produce (1) tables displaying both fixed effects estimates (formatted as “estimate (standard error)”) and variance estimates for the random effects and residual variance, and (2) plots comparing estimates and confidence intervals for the E effect across models.
extract_lmm_estimates <- function(fit) {
# Given a linear model or LMM fit, extract fixed effects and variance estimates
FEs <- summary(fit)$coefficients %>%
as_tibble(rownames="term") %>%
mutate(across(c(Estimate, `Std. Error`), ~ round(., 3)),
estimate = paste0(Estimate, " (", `Std. Error`, ")")) %>%
select(term, estimate)
VCs <- if (class(fit) == "lm") {
residual_variance <- round(summary(fit)$sigma^2, 3)
tibble(term = "Variance - Residual",
estimate = as.character(residual_variance))
} else {
as.data.frame(VarCorr(fit)) %>%
mutate(term = paste0("Variance - ", grp),
variance = as.character(round(vcov, 3))) %>%
select(term, estimate=variance)
}
bind_rows(FEs, VCs) %>%
setNames(c("Term", "Estimate (SE)"))
}
compare_lmm_estimates <- function(lmm_list) {
# Given a list of two or more model estimates from extract_lmm_estimates(),
# create a data frame for comparison of results
estimates_list <- lapply(lmm_list, extract_lmm_estimates)
comparison_df <- reduce(estimates_list, function(x, y) {
full_join(x, y, by="Term")
})
setNames(comparison_df, c("Term", names(lmm_list)))
}
print_comparison_tbl <- function(tbl, caption="") {
tbl %>%
kable(caption=caption) %>%
kable_styling(full_width=FALSE)
}
plot_comparison <- function(tbl, fct_levels=NULL) {
if (is.null(fct_levels)) fct_levels <- names(tbl)[-1]
tbl %>%
pivot_longer(-Term, names_to="model", values_to="value") %>%
filter(!is.na(value),
grepl("E", Term)) %>%
mutate(
model = factor(model, levels=fct_levels),
estimate = as.numeric(gsub(" \\(.*", "", value)),
SE = as.numeric(gsub(".*\\(|\\)", "", value))
) %>%
ggplot(aes(x=model, y=estimate)) +
geom_point() +
geom_errorbar(aes(ymin = estimate - 1.96 * SE, ymax = estimate + 1.96 * SE),
width=0.1) +
geom_hline(yintercept=1, linetype="dashed", color="gray") +
labs(x="Model", y=expression("E effect (" * beta[E] * ") estimate (95% CI)"))
}
We have various options for modeling this dataset using standard linear regression (LR):
# LR using first timepoint
lm_timept1_fit <- lm(Y ~ G + E, data=filter(long_df, timept == "t1"))
# LR using mean values
lm_mean_fit <- lm(Y ~ G + E,
data=summarise(group_by(long_df, id),
Y=mean(Y), G=G[1], E=mean(E)))
# LR using all datapoints
lm_all_fit <- lm(Y ~ G + E, data=long_df)
lm_comparison_df <- compare_lmm_estimates(list(
LMM = primary_lmm_fit,
`LR - 1st timept` = lm_timept1_fit,
`LR - Mean values` = lm_mean_fit,
`LR - Ignore clustering` = lm_all_fit
))
print_comparison_tbl(lm_comparison_df)
| Term | LMM | LR - 1st timept | LR - Mean values | LR - Ignore clustering |
|---|---|---|---|---|
| (Intercept) | 0.011 (0.04) | -0.969 (0.04) | -2.496 (0.037) | -2.497 (0.03) |
| t | -1.003 (0.006) | NA | NA | NA |
| G | 0.444 (0.046) | 0.41 (0.05) | 0.443 (0.046) | 0.444 (0.038) |
| E | 1.001 (0.011) | 0.952 (0.031) | 0.978 (0.036) | 0.997 (0.024) |
| Variance - id | 0.774 | NA | NA | NA |
| Variance - Residual | 0.202 | 0.981 | 0.825 | 2.234 |
plot_comparison(lm_comparison_df)
Notes:
In summary, LMMs provide the flexibility to effectively use all available longitudinal data, while approaches based on standard LR tend to produce effect estimates with poorly-calibrated SEs (producing false positives and/or false negatives).
LMMs have access to two sources of information: cross-sectional (between-person) comparisons and longitudinal (within-person) comparisons. To see when this might come in handy, we will simulate a similar dataset that includes a time-constant confounder of the E-Y relationship.
Then, we can test a series of models:
Now, assume the true C is either unknown, unmeasured, or badly measured:
long_df <- generate_dataset(icc_C=1, beta_CE=1, beta_CY=1) # ICC_C of 1 means no change over time
long_diff_df <- long_df %>%
group_by(id) %>%
mutate(E_bl = E[timept == "t1"],
Y_bl = Y[timept == "t1"],
E_diff = E - E[timept == "t1"]) %>%
ungroup()
confounded_adj_lmm_fit <- lmer(Y ~ t + G + E + C + (1|id), data=long_diff_df)
confounded_adj_lr_fit <- lm(Y ~ t + G + E + C, data=long_diff_df)
confounded_unadj_lr_fit <- lm(Y ~ t + G + E, data=long_diff_df)
confounded_unadj_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_diff_df)
confounded_decomp_lmm_fit <- lmer(Y ~ t + G + E_bl + E_diff + (1|id), data=long_diff_df)
confounded_adj_bl_E_fit <- lmer(Y ~ t + G + E + E_bl + (1|id), data=long_diff_df)
decomp_comparison_df <- compare_lmm_estimates(list(
`Adjusted\nLMM` = confounded_adj_lmm_fit,
`Adjusted\nLR` = confounded_adj_lr_fit,
`Unadjusted\nLR` = confounded_unadj_lr_fit,
`Unadjusted\nLMM` = confounded_unadj_lmm_fit,
`Decomposed\nLMM` = confounded_decomp_lmm_fit,
`BL E-adjusted\nLMM` = confounded_adj_bl_E_fit
))
print_comparison_tbl(decomp_comparison_df)
| Term | Adjusted LMM | Adjusted LR | Unadjusted LR | Unadjusted LMM | Decomposed LMM | BL E-adjusted LMM |
|---|---|---|---|---|---|---|
| (Intercept) | 0.025 (0.041) | 0.025 (0.04) | 0.026 (0.049) | 0.013 (0.056) | 0.015 (0.05) | 0.015 (0.05) |
| t | -1.009 (0.006) | -1.009 (0.014) | -1.015 (0.017) | -1.009 (0.006) | -1.009 (0.006) | -1.009 (0.006) |
| G | 0.565 (0.046) | 0.566 (0.025) | 0.578 (0.03) | 0.545 (0.065) | 0.585 (0.057) | 0.585 (0.057) |
| E | 1.011 (0.011) | 1.024 (0.016) | 1.511 (0.013) | 1.079 (0.011) | NA | 1.028 (0.011) |
| C | 0.985 (0.031) | 0.971 (0.022) | NA | NA | NA | NA |
| Variance - id | 0.775 | NA | NA | 1.624 | 1.246 | 1.246 |
| Variance - Residual | 0.201 | 0.974 | 1.447 | 0.203 | 0.201 | 0.201 |
| E_bl | NA | NA | NA | NA | 1.52 (0.026) | 0.492 (0.027) |
| E_diff | NA | NA | NA | NA | 1.028 (0.011) | NA |
decomp_comparison_df %>%
pivot_longer(-Term, names_to="model", values_to="value") %>%
filter(!is.na(value),
grepl("E", Term)) %>%
mutate(model = factor(model, levels=names(decomp_comparison_df)[-1]),
estimate = as.numeric(gsub(" \\(.*", "", value)),
SE = as.numeric(gsub(".*\\(|\\)", "", value))) %>%
ggplot(aes(x=model, y=estimate, color=Term)) +
geom_point() +
geom_errorbar(aes(ymin = estimate - 1.96 * SE, ymax = estimate + 1.96 * SE),
width=0.1) +
geom_hline(yintercept=1, linetype="dashed", color="gray") +
labs(x="Model", y=expression("E effect (" * beta[E] * ") estimate (95% CI)"))
What do we observe?
What happens if we cut the number of timepoints in half in the context of this time-constant confounder?
long_diff_df <- filter(long_diff_df, timept %in% c("t1", "t2"))
confounded_smaller_K_unadj_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_diff_df)
smaller_K_comparison_df <- compare_lmm_estimates(list(
`Decomposed LMM (K = 4)` = confounded_decomp_lmm_fit,
`Unadj. LMM (K = 4)` = confounded_unadj_lmm_fit,
`Unadj. LMM (K = 2)` = confounded_smaller_K_unadj_lmm_fit
))
plot_comparison(smaller_K_comparison_df)
With a smaller K, there is less longitudinal information for the model to use in estimating within-person effects, so its estimate will be weighted relatively more towards the (confounded) between-person estimate.
Technical note: The “time-constant” confounder doesn’t actually have to be constant over time to see this type of benefit from the LMM! The important part is that the confounder is clustered to some degree such that people vary in their typical value for that trait. For example, socioeconomic status or health-consciousness may not be strictly constant in a given person, yet most people will tend to have similar values for these traits over time.
To see this, we can simulate a series of confounded datasets, varying the ICC for C from 0 (no within-person correlation) to 1 (time-constant).
long_df <- generate_dataset(icc_C=0, beta_CE=1, beta_CY=1)
confounded_icc0_unadj_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_df)
long_df <- generate_dataset(icc_C=0.5, beta_CE=1, beta_CY=1)
confounded_icc0.5_unadj_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_df)
long_df <- generate_dataset(icc_C=1, beta_CE=1, beta_CY=1)
confounded_icc1_unadj_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_df)
smaller_K_comparison_df <- compare_lmm_estimates(list(
`Unadj. LMM (ICC_C = 0)` = confounded_icc0_unadj_lmm_fit,
`Unadj. LMM (ICC_C = 0.5)` = confounded_icc0.5_unadj_lmm_fit,
`Unadj. LMM (ICC_C = 1)` = confounded_icc1_unadj_lmm_fit
))
plot_comparison(smaller_K_comparison_df)
Despite the magnitude of confounding being identical in each of the above models, the estimates are very different. As within-person consistency in the confounder value increases (i.e., higher ICC), the random intercept is better able to capture and adjust for its effect, allowing the model to recover an estimate closer to the true value.
long_df <- generate_dataset(icc_Y=0.2)
lower_icc_Y_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_df)
lower_icc_Y_lr_fit <- lm(Y ~ t + G + E, data=long_df)
lower_icc_comparison_df <- compare_lmm_estimates(list(
Primary = primary_lmm_fit,
`Primary - LR` = lm_all_fit,
`Less Y clustering - LMM` = lower_icc_Y_lmm_fit,
`Less Y clustering - LR` = lower_icc_Y_lr_fit
))
print_comparison_tbl(lower_icc_comparison_df)
| Term | Primary | Primary - LR | Less Y clustering - LMM | Less Y clustering - LR |
|---|---|---|---|---|
| (Intercept) | 0.011 (0.04) | -2.497 (0.03) | 0.024 (0.04) | 0.024 (0.04) |
| t | -1.003 (0.006) | NA | -0.991 (0.012) | -0.991 (0.014) |
| G | 0.444 (0.046) | 0.444 (0.038) | 0.495 (0.033) | 0.495 (0.025) |
| E | 1.001 (0.011) | 0.997 (0.024) | 0.993 (0.017) | 0.994 (0.016) |
| Variance - id | 0.774 | NA | 0.22 | NA |
| Variance - Residual | 0.202 | 2.234 | 0.769 | 0.989 |
plot_comparison(lower_icc_comparison_df)
When Y is less clustered, less of the variability can be attributed by the LMM to the random intercept. So, as seen in the two estimates on the right, LMM loses its “advantage” and the SEs are similar compared to LR.
long_df <- generate_dataset(icc_E=0.1)
low_icc_E_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_df)
long_df <- generate_dataset(icc_E=0.9)
high_icc_E_lmm_fit <- lmer(Y ~ t + G + E + (1|id), data=long_df)
icc_E_comparison_df <- compare_lmm_estimates(list(
`Less-clustered E` = low_icc_E_lmm_fit,
Primary = primary_lmm_fit,
`More-clustered E` = high_icc_E_lmm_fit
))
plot_comparison(icc_E_comparison_df)
The more clustered E is (moving to the right in the plot), the more similar a person’s E value is to their value at other timepoints. This means the model has less within-person variability in E to use in estimating its effect, so the effect estimate remains similar but with less confidence (higher SEs).
set.seed(1)
long_df <- generate_dataset()
long_df %>%
ggplot(aes(x=t, y=Y, group=id)) +
geom_point() +
geom_line()
long_df %>%
ggplot(aes(x=t, y=Y, group=id)) +
geom_point() +
geom_line() +
geom_hline(yintercept=mean(long_df$Y), linetype="dashed", color="red")
long_df %>%
ggplot(aes(x=t, y=Y, group=id)) +
geom_point() +
geom_line() +
geom_smooth(aes(group=1), method="lm", formula="y ~ x", se=FALSE,
linetype="dashed", color="red")
long_df %>%
ggplot(aes(x=G, y=Y)) +
geom_point()
long_df %>%
ggplot(aes(x=G, y=Y)) +
geom_point() +
geom_smooth(aes(group=1), method="lm", formula="y ~ x", se=FALSE,
linetype="dashed", color="red")
long_df %>%
ggplot(aes(x=E, y=Y)) +
geom_point()
long_df %>%
ggplot(aes(x=E, y=Y)) +
geom_point() +
geom_smooth(aes(group=1), method="lm", formula="y ~ x", se=FALSE,
linetype="dashed", color="red")
lmm_fit <- lmer(Y ~ G + E + (1|id), data=long_df)
long_df$Y_pred <- predict(lmm_fit)
long_df %>%
filter(id %in% 1:5) %>%
ggplot(aes(x=E, y=Y, color=id)) +
geom_point(data=slice(long_df, 1:100), color="gray") +
geom_point() +
geom_line(aes(y=Y_pred), linewidth=1)